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METHODS AND APPARATUSES 
FOR MEASURING DIVERSITY 
IN COMBINATORIAL STRUCTURES 



Inventor: 
Tad Hogg 

BACKGROUND OF THE INVENTION 

1 . FIELD OF THE INVENTION 

5 The present invention pertains to the field of automated analysis of combinatorial 

structures. Specifically, the present invention involves automated measurement of diversity of 
combinatorial structures such as graphs, as can be used to model the Web. 

2. DISCUSSION OF THE RELATED ART 

10 The Worid Wide Web can be viewed as an ecology with a rich and rapidly evolving set 

of relationships among its components. The variety, or diversity, of structures is an important 
aspect of ecological systems. Diversity is related to the range of capabilities, the adaptability and 
the overall complexity of the ecology. While appealing in concept, diversity is difi&cult to quantify 
and measure for combinatorial structures such as the Web, particularly without resorting to 
1 5 asymptotic limits that apply only to very large systems. 

The Worid Wide Web consists of a rapidly growing and changing collection of pages. The 
relationships among these pages, including explicit hyperlinks, textual similarity, patterns of 
usage and overlapping authorship, form a rich, evolving structure. 

These relationships are most directly usefiil for finding items relevant for some question, 
20 either by manually following links or through automated searches. However, the relationship s can 
also be viewed fi-om an ecological perspective. The ecological perspective considers the resulting 
large-scale structure and evolution of the Web, which results firom the actions of many 
autonomous individuals with a variety of goals. 

A variety of proposals for precise definitions of diversity, and the related concept of 
25 complexity, have been made for various types of structures. A formal and general definition is 
algorithmic complexity, the length of the shortest program that produces the structure. While 
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algorithmic complexity has good formal properties, it applies only asymptotically to large 
structures, thus is unacceptable for small structures, and is not computable for individual 
structures. Another prior approach, described by B. A. Huberman and T. Hogg, "Complexity and 
Adaptation," Physica, 22D:376-384, 1986, defined diversity in terms of the number of distinct 
component parts, which is readily computed. However, as defined, it applies only to trees, not 
the more general combinatorial structures needed to describe the Web. Approximate entropy is 
also easily computed and readily related to information theory measures asymptotically; butusefiil 
even for small sequences. However, it applies only to sequences, a very limited type of structure. 

As is apparent fi-om the above discussion, a need exists for a effective and easily 
computable measure of diversity for general combinatorial structures of any arbitrary size, such 
as graphs which can be used to model Web pages or groups of Web pages. 

SUMMARY OF THE INVENTION 
Some conventional measures for diversity of combinatorial structures apply only 
asymptotically to large structures and not directly computable for a given arbitrary structure. 
Other conventional measures for diversity, while computable and usefiil for small structures, do 
not apply to general combinatorial structures. An object of the present invention is to develop 
a computable measure of diversity for usefiil for general combinatorial structures of any size, such 
as graphs which can be used to model Web pages or groups of Web pages. 

According to the present invention, a method for computing a diversity measure H(m) for 
a combinatorial structure involves identifying allM possible substructures having m elements fi-om 
among the n elements of the combinatorial structure C. Thus, M is the number of unique subsets 
of m elements fi-om n elements, where m<n, and is equal to n!/[(n-m)!m!]. The number of the 
substructures that are similar to each such substructure is determined, and the fi-equency of each 
distinct substructure is calculated using the number of similar substructures and the total number 
of substructures M. The method uses the fi-equency of each distinct substructure to compute an 
entropy corresponding to m. By the same process described above, and entropy corresponding 
to m+1 is computed. The entropy corresponding to m+1 is subtracted fi-om the entropy 
corresponding to m to produce the diversity measure H(m). H(m) is related to the expected 
probability that components of size m+1 are similar given that those of size m are similar. 

In the preferred embodiment, similar substructures are determined by being identical or 
isomorphic after a monotonic renumbering of the substructure elements is performed. In an 
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alternative embodiment, a distance fUnction is used to compute a distance between two 
substructures, and only if the distance is less than a predetermined threshold are the two 
substructures determined to be similar. This is particularly useful for graphs having weights 
associated with nodes and/or Unks. In another embodiment, similar substructures are determined 

5 by being identical after a monotonic renumbering of the substructure elements is performed. 

In the preferred embodiment, the entropy is computed by summing the frequency of each 
distinct substructure multipUed by the logarithm of the frequency of each distinct substructure. 
In an alternative embodiment, the entropy is computed by summing the frequency of each distinct 
substructure by the logarithm of the quotient of the frequency divided by an expected frequency 

10 of the distinct substructure. The expected frequency of each distinct substructure either is 
determined empirically from observed data, is estimated, oris extracted from a theoretical model. 
In one embodiment, frequency times log frequency is summed for all distinct substructures; in 
another embodiment, log frequency is summed for all substructures, whether distinct or not, and 
the resulting sum is divided by the total number of substructures. Both embodiments yield the 

15 same result, because in the latter case each member of a distinct similarity group is included in the 
summation, whereas in the former case only one member of a distinct similarity group is included 
in the summation. 

In order to ensure the possibility of a reasonable number of instances of each subgraph, 
m must be quite small, for example less than 0(ln In n). Random structures are somewhat less 
20 diverse than the maximum possible diversity measure. At the other extreme, highly-ordered 
structures have low values of H. Thus, the diversity measure according to the present invention 
is able to distinguish diverse structures from ordered or random ones, a usefiil criterion for 
identifying complex structures. 

Generalized graphs such as can be used to model the Web are combinatorial structures 
25 suitable for use with the methods according to the present invention. These and other features 
and advantages of the present invention are fiilly described in the Detailed Description of the 
Invention and illustrated in the Figures. 
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BRIEF DESCRIPTION OF THE DRAWINGS 
Figure 1 illustrates a general purpose computer architecture suitable for implementing the 
methods according to the present invention. 

Figure 2 is a flow chart illustrating a method for computing a diversity measure according 
5 to the present invention. 

Figure 3 is a flow chart illustrating a method for performing the step of computing the 
numbers through n^ of substructures which are identical or isomorphic to each other ui the 
methods of the present invention. 

Figure 4 is a flow chart illustrating a method for performing the step of computing the 
1 0 entropy based upon n^ through n^ and M in the methods of the present invention. 

Figure 5 illustrates a graph having four nodes and the its three distinct monotonically 
renumbered subgraphs having three nodes. 

Figure 6 is a table illustrating the four subgraphs, their monotonically renumbered 
equivalents, the distinct similarity group to which each subgraph belongs, and the sknilarity count 
15 for each subgraph. 

Figure 7 illustrates a histogram of the number of graphs with 5 nodes as a fimction of the 
diversity measure H(2) according to the present invention. 

Figure 8 illustrates a histogram of the number of trees with 6 nodes as a fimction of H(2) 
as computed according to methods of the present invention. 
20 Figure 9 illustrates another method for performing the step of computing the numbers ni 

through n^i of substructures which are similar to each other in the methods of the present 
invention. 

Figure 10 is a flow chart illustrating another method for performing the step of computing 
the entropy based upon through n^^ and M in the methods of the present invention. 
25 Figure 11 is a flow chart illustrating yet another method for performmg the step of 

computing the entropy using expected frequencies for distinct substructures and based upon n^ 
through n^i and M in the methods of the present invention. 

The Figures are more fully described in the Detailed Description of the Invention. 
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DETAILED DESCRIPTION OF THE INVENTION 
From a practical viewpoint, identifying diverse parts of the Web is usefiil for finding 
starting points for searches or finding a variety of viewpoints on a topic. As agent-based Web 
services develop, maintaining diversity is applicable to help ensure a fUU range of possible services 
are considered. 

However, a recently introduced concept of approximate entropy is generalized according 
to the present invention to appUcationto combinatorial structures such as the Web by examining 
the components of the structure, thereby extending a previously developed diversity measure for 
trees. See, B. A. Huberman andT. Hogg, Complexily and Adaptation, Physica, 22D:376-384, 1986. 
The measure of diversity according to the present invention particularly well-suited to relatively 
small sets of items and is simple to compute. 

Diversity is one important aspect of ecosystems, including biological ecosystems and 
social systems (such as the legal and scientific communities) as well as the Web. By maintaining 
a variety of structures and approaches to addressing various problems, diverse systems as a whole 
are robust, or adaptable, in the face of continual changes. M^taining diversity is also an issue 
in computational methods based on this analogy, for example, genetic algorithms. Diversity will 
also be important for agent-based systems that make use of the Web, for example, for electronic 
commerce, to make sure a wide variety of options are considered in searching for the best 
possible transactions. 

While simple in concept, automated use of diversity requires a simple quantitative 
measure. Ideally, a good measure of diversity will: 

1 . focus on those aspects of an ecology relevant to desired applications; 

2. be easily computable; 

3. allow quantitative comparison among different structures; 

4. apply directly to given, finite-size structures (therefore, not just asymptotically large 
sizes); 

5. distinguish diverse structures fi-om simple ordered or random ones; and 

6. apply to the structure itself, independent of the process whereby it was created. 

A new, and broadly applicable, diversity measure according to the present invention is 
obtained by extending the notion of approximate entropy to general combinatorial structures 
by examining the component parts of the structure. This generalization combines the idea of 
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examining component parts (extended to any combinatorial structure, not just trees) with the 
use of entropy-like functions to obtain a quantitative value with well-known theoretical 
properties. 

For a given combinatorial structure, such as a graph, the general method of computing 
the new measure according to the present invention is to first identify all the substructures of a 
given small size, and then count the average number of these substructures that are similar to 
each other according to a specified distance measure between the substructures. Because 
there are a variety of ways to define substructures of combinatorial structures and measure 
their similarity, this general method for defining diversity can be instantiated in many ways. 
The particular choices depend on the aspects of the structure that are important for a given 
application. 

Figure 1 illustrates a general purpose computer system 100 suitable for implementing 
the methods according to the present invention. The general purpose computer system 100 
includes at least a microprocessor 104. The general purpose computer may also mclude 
random access memory 102, ROM memory 103, a keyboard 107, and a modem 108. All of 
the elements of the general purpose computer 100 are optionally tied together by a common 
bus 101 for transporting data between the various elements. The bus 101 typically includes 
data, address, and control signals. Although the general purpose computer 100 illustrated in 
Figure 1 includes a single data bus 101 which ties together all of the elements of the general 
purpose computer 100, there is no requirement that there be a single communication bus 101 
which connects the various elements of the general purpose computer 100. For example, the 
microprocessor 104, RAM 102, and ROM 103, are alternatively tied together with a data bus 
while the hard disk 105, modem 108, keyboard 107, display monitor 106, and network 
interface 109 are connected together with a second data bus (not shown). In this case, the 
first data bus 101 and the second data bus (not shown) are linked by a bidirectional bus 
interface (not shown). Alternatively, some of the elements, such as the microprocessor 102 
and RAM 102 are connected to both the first data bus 101 and the second data bus (not 
shown), and communication between the first and second data bus occurs through the 
microprocessor 102 and RAM 102. The network interface 109 provides optional 
communication capabihty to a local area network LAN using an ethemet connection, for 
example. The modem 108 allows the computer 100 to communicate optionally through the 
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telephone system. The methods of the present invention are executable on any general 
purpose computer system such as the 100 illustrated in Figure 1, but there is clearly no 
limitation that this computer system is the only one which can execute the methods of the 
present invention. 

Figure 2 is a flow chart illustrating a method 200 for computing a diversity measure 
H(m) according to the present invention. The diversity fiinction is invoked at step 201. At 
step 202, the method identifies the M substructures c^ through c^ each having m elements 
fi-om among the n elements of C. In other words, all possible combinations of m elements are 
selected fi-om the n elements of C. Thus, M is n!/[(n-m)!m!]. In the case of a graph, the 
elements are the nodes of the graph, and a Hnk between any particular nodes a and b of the 
substructures Cj containing nodes a and b exists only if nodes a and b have a link between them 
in the graph C. Thus, each of the M substructures c^ through c^^ have a different set of nodes, 
and the links between the nodes of the substructures (subgraphs) are determined by the 
connectivity pattern in the original graph C. At step 203, for each one of the substructures c^ 
through Cm, a corresponding number (one of % through ni^ is computed by counting 
substructures firom among the M substructures that are similar to the one of the substructures 
for which the number is being computed. Because each substructure is at least similar to 
itself, each one of % through n,^ is at least one. Similarity or non-similarity between two 
substructures is alternatively determined by any one of variety of manners according to the 
present invention, as is described below. 

At step 204, the entropy <I>(m) is computed based upon the numbers through n^ and 
M itself If this is the first execution of step 204 since the invocation of the diversity measure 
method according to the present invention, then step 205 increments m to m+1. The 
computation of a second entropy 0(m+l) is then computed by repeating steps 202 through 
204. with m+1 substituted for m. After the second execution of step 204, step 206 computes 
difference between 0(m) and 0(m+l) by subtracting 0(m+l) fi-om <5(m). The resulting H(m) 
is the diversity measure according to the preferred embodiment of the present invention. 
However, it is to be noted that entropy 0(m) as applied to combinatorial structures is itself a 
novel measure according to the present invention regardless of whether or not entropy 
0(m+l) is computed and compared with 0(m). The method 200 concludes at step 207 with 
the diversity measure H(m) being returned as the result of the method. 
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Figure 3 is a flow chart illustrating a method 300 for performing the step of computing 
the numbers ni through n^ of substructures which are identical or isomorphic to each other in 
the methods of the present invention. Thus, the method 300 is a specific implementation of 
the step 203 illustrated in figure 2. The method starts at step 301, and the substructure 
counter number i is initiaUzed to 1 at step 302. At step 303, the m elements of i"" substructure 
Cj are monotonically renumbered fi-om 1 to m, as will be described below in conjunction with 
Figure 6. At step 304, substructure counter number j is initialized to 1, and the count number 
ni of similar substructures is initialized to 0. At step 305, the m elements of jth substructure c, 
are monotonically renumbered fi-om 1 to m. At step 306, the monotonically renumbered 
substructures C; and Cj are compared according to some similarity criterion. For example, 
strict graph identity is one criterion for determining similarity. This criterion is preferably 
made less restrictive by adding graph isomorphism as an acceptable form of similarity. Thus, 
two graphs are isomorphic if the nodes can be renumbered in any arbitrary way to produce 
identical graphs. 

If the two substructures are similar, then step 307 increments the similarity count number ni by 
one. If the two substructures are not similar, then step 308 does not increment the similarity 
count number nj as and q are distinct firom each other and thus must be classified in 
different distinct similarity groups as will be described below with regard to Figure 6. In 
either case, step 309 determines if the last substructure c^ has been compared to substructure 
q. If the last substructure c^ has not been compared to Cj, then step 3 10 increments the 
substructure counter j and returns the method to step 305. If the last substructure c,^ has been 
compared to Cj, then the similarity count nvmiber nj is stored somewhere such as in the table 
600 illustrated in Figure 6. Step 3 12 determines if a similarity count number n}^ has been 
computed for the last substructure c^. If the similarity count number n^ has been computed, 
the method is complete at step 314 with the return of the similarity count numbers ni through 
n^i- If the similarity count number n,^ has not been computed, then the step 313 increments 
the substructure counter i by one and returns the method to step 303. 

It is to be understood that the method 300, as illustrated in Figure 3, is logically how 
the similarity count numbers ni through are computed; however, various shortcuts and 
optimizations of the computation are anticipated and included in the preferred embodiment of 
the present invention. For example, a table 600 such as illustrated in Figure 6 is preferably 
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developed as method 300 is being perfonned, thereby enabling quick retrieval of results which 
would otherwise be repeatedly computed. For example, the first time step 303 or 305 is 
performed on any specific substructure c^ through c^, the monotonically renumbered 
substructure is preferably stored in a table such as illustrated in Figure 6. When subsequent 
5 executions of steps 303 and 305 are performed on the same substructure, the resuh is merely 
looked up in the table 600 rather than being recomputed each time. In this manner, each 
substructure c^ through c^ is monotonically renumbered only one time. Other preferred 
optimizations are discussed below. 

As a specific usefiil example, for any graph C with n nodes, the ("J = n!/[(n-m)!m!] = 
10 M subgraphs with m nodes are the substructures identified by the method of the present 

invention. All possible combinations of m nodes are identified, and the resulting substructures 
are the m nodes as connected in the same manner as in the graph C. 

The similarity measure between these subgraphs can be chosen to be subgraph identity, 
subgraph isomorphism, or a more relaxed similarity measure. In the case that subgraph 
15 identity is chosen as the smiilarity measure, each of the subgraphs is identical to only one of 
the 2 ^ CJ different graphs with m nodes. (The symbol is used herein to represent 
exponentiation.) Let ni be the number of subgraphs identical to the i*^ m-node graph and ^ = 
n/CJ their relative fi-equency, for i = 0 , . . ., 2 ^ CJ - 1. The entropy of this set of 
fi-equencies is as follows in Equation 1. 

20 ^{m) = Y,fML (1) 

Finally, the behavior of the structure with respect to components of size m+1 is H(m) 
= 0(m) - 0(m+l). This definition according to the present invention generalizes the measure 
of approximate entropy fi-om sequences to any combinatorial structure. It is related to the 
expected probability that components of size m+1 are similar given that those of size m are 
25 similar. 

More generally, for an arbitrary combinatorial structure C of size n, the substructures 
Ci, Cj, . . . Cm of size m are identified by the method according to the present invention. 
Among these substructures, the method then evaluates the frequency with which each distinct 
substructure occurs to obtain the relative fi-equencies ^. These relative frequencies are then 
30 used in Eq. 1 to evaluate the diversity value for the structure C. Thus, the measure according 
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to the present invention applies to any structure for which substructures of a given size are 
identifiable. 

Figure 4 is a flow chart illustrating a method 400 for performing the step of computing 
the entropy based upon ni through n^ and M in the methods of the present invention. Thus, 
the method 400 illustrates how step 204 illustrated in Figure 2 is performed. The method 
starts at step 401 . At step 402, the sum S is initialized to zero, and the similarity counter i is 
initialized to one. At step 403, the fi-equency ^ is computed by dividing nj by M. At step 404, 
a logarithm of firequency ^ is computed. At step 405, ^ In ^ is computed by multiplying the 
results of steps 403 and 404. At step 406, the sum S is accumulated by adding the result of 
step 406 to the current value of sum S. Step 407 determines if all distinct substructures have 
been considered. If all distinct substructures have been considered, the method 400 returns 
the current value of the sum S as the entropy <I>(m) at step 409. If step 407 determines that 
all distinct substructures have not been considered, then step 408 increments i to the next 
distinct substructure and returns the method to step 403. 

It is to be noted that the substructure counter i is not incremented by one in step 408 
but rather is increased by a value which advances i to the next distinct substructure. For 
example, if substructures and are similar, then they are deemed to belong to the same 
distinct similarity group. In table 600 in Figure 6, Ci and C4 belong to the same distinct 
similarity group labeled "A". Thus, only one of c^ and c^, namely the first one c^, is taken into 
consideration in the computation 400 shown in Figure 4 and described by Equation 1 . All 
members of the same distinct similarity group necessarily will have the same similarity count 
nj. This is because if q is similar to Cj then Cj is similar to assuming that the same similarity 
definition is used. Thus, % and n4 in table 600 shown in Figure 6 are have equal values of 2. 
The sum of the fi-equencies of all distinct substructures is 1, thus the multiplication in step 405 
of the logarithm of ^ by the fi-equency ^ properly weights each logarithm. However, the sum 
of all the fi-equencies f^ through fM for all CJ possible substructures (including non-distmct 
substructures) is greater than 1 if any substructures are similar. 

Using the method 400 and a table such as shown in Figure 6, it is possible to reduce 
the number of table entries (rows) to the number of distinct substructures. For example, a 
new row is added each time a new distinct similarity group is detected. In this optimized case, 
each time a substructure q is determined to be similar to another substructure Cj in step 306, 
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then step 307 marks the substructure Cj as having been fiilly considered in the computation. In 
this optimized case, step 303 marks substructure q as having been folly considered in the 
computation. In this modified method 300, step 3 10 and 3 13 are replaced with steps which 
advance the respective substructure counters to the next value which has not been marked as 
considered in the computation. Similarly in this case, step 304 is modified to initialize j to the 
first value which has not been marked as considered in the computation. Also in this case, test 
309 is replaced with a test to determine if all unmarked substructures have been used as Cj, and 
test 3 12 is replaced with a test to determine if all substructures have been marked and thus 
categorized into a distinct similarity group. In this modified method, step 306 is performed 
between M-1 and M(M-l)/2 times, inclusive, depending upon how many distinct similarity 
groups actually exist among the substructures c^ through c^- In this modified method, if only 
one distinct similarity group exists, then all substructures are similar, and step 306 is 
performed only M-1 times, resulting in only one table entry having a similarity count n^ = M. 
If all substructures are distinct, then M distinct similarity groups exist, and step 306 is 
performed M(M-l)/2 times, resulting in M separate table entries each having a similarity count 
of 1. In any case, it is clear fi-om the above modified example that many tradeoffs between 
computation and storage can be applied to correctly compute the necessary similarity counts 
for each distinct similarity group. Thus, the example shown m Figure 3, m which step 306 is 
performed times, is offered to illustrate in a logically simple way, but not computationally 
efficient or optimal way, to compute the necessary similarity counts for each distinct similarity 
group. All such alternative ways for computing the necessary similarity counts and distinct 
similarity groups using various forms of method 300 and table 600 are deemed to lie within 
the spirit and scope of the present invention. 

Large values of H, for a range of m values, correspond to diverse structures, with 
good representation of the foil range of possible substrucfores. To provide a discriminating 
measure, m must not be too large. As an extreme, if m = n then every graph would only have 
a single component (namely the entire graph) resulting in the same firequency distribution for 
every possible graph. In fact, to ensure the possibility of a reasonable number of instances of 
each subgraph, m must be quite small, for example less than 0(ln In n). 

One interesting property of this entropy-based measure is that the most diverse cases 
are relatively rare. Typical, or "random" structures are somewhat less diverse than the 
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maximum possible diversity measure, as was also observed for diversity measures applied to 
trees. At the other extreme, highly ordered structures have low values of H. Thus the 
diversity measure according to the present invention is able to distinguish diverse structures 
from ordered or random ones, a useful criterion for identifying complex structures. 
5 Another aspect of this measure is that different representations of the same combina- 

torial structure can give different values for H, especially for small cases. This property 
emphasizes the importance of selecting the component parts and similarity metric appropriate 
for a given application. This observation contrasts Avith the behavior of algorithmic 
complexity, which is independent of representation (for asymptotically large structures). 
10 Figure 5 illustrates a graph having four nodes and the its three distinct monotonically 

renumbered subgraphs having three nodes. As an example, consider a 4-node graph, which 
has (\) = 4 possible 3-node subgraphs consisting of the following sets of nodes: {1,2,3}, 
{1,2,4}, {1,3,4}, and {2,3,4}. For the particular graph shown in Figure 5, in the first and last 
of these sets, the second node of the subgraph has two edges. The subgraph consisting of 
15 { 1,2,4} has thelst node with two edges, and the subgraph of { 1,3,4} has the 3rd node with 
two edges. These distinct subgraphs, are shown clockwise in the Figure firom the upper right. 
Thus, the counts for these three subgraphs are 2, 1 and 1, giving f^ = Yi, and f2=f3=l/4. Thus 
O(3) = S^ln^=-1.04. 

Figure 6 is a table illustrating the four subgraphs, their monotonically renumbered 
20 equivalents, the distinct similarity group to which each subgraph belongs, and the similarity 
count for each subgraph. For substructures Ci through C4 of the graph 501 illustrated in 
Figure 5, the substructure itself is listed in table 600. The results illustrated in table 6 pertain 
to a test for identity in step 306 of the method 300. The substructure shows the connectivity 
of the member nodes as well as the three member nodes of the substructure. Table 600 also 
25 shows the monotonically renumbered substructure corresponding to each substructure. Each 
monotonically renumbered substructure includes nodes which are numbered 1 through m 
regardless of which of the n nodes were actually members of the substructure. For example, 
because node 3 is not included in substructure Cj, node 4 is renumbered as node 3. Because 
node 2 is not included in substructure C3, nodes 3 and 4 are renumbered as nodes 2 and 3, 
30 respectively. Because node 1 is not included in substructure C4, nodes 2, 3, and 4 are 

renumbered as nodes 1, 2, and 3, respectively. Monotonic renumbering of substructure Ci 
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produces no change, because substructure included the first three elements of the structure. 
Because the monotonically renumbered substructures Ci and C4 are identical (1-2-3), they 
belong to the same distinct similarity group A, since graph identity is specified as the similarity 
measure in this example. If graph isomorphism were hypothetically also included as the 

5 similarity measure, all for substructures c^ through c^ would be in the same distinct similarity 
group A, because all four substructures consist of a chmn of three nodes, and are thus 
isomorphic. A three-node chain and a three-node fiiUy-connected graph (a triangle) are not 
isomorphic. Because the distinct similarity group A has 2 members, namely c^ and C4, the 
distinct similarity count numbers % and n4 are 2. The other two substructures and C3 are 

10 only similar to themselves, and hence their respective similarity count numbers nj and are 
each 1. 

Figure 7 illustrates a histogram of the number of graphs with 5 nodes as a fiinction of 
the diversity measure H(2) according to the present invention. There are 2 ^ (^2) = 1^24 
different 5 node graphs. The range of diversity measures is easily illustrated for small 
15 structures by explicitly enumerating all possible unique instances. For example. Figure 7 

shows the distribution of H(2) for all graphs with 5 nodes. This Figure illustrates how most of 
the graphs have moderately large values, but only a few graphs have the maximum possible 
value. This behavior contmues with graphs of larger sizes. 

Figure 8 illustrates a histogram of the number of trees with 6 nodes as a fiinction of 
20 H(2) as computed according to methods of the present invention. There are n^""^^ possible n- 
node trees. Thus, there are 1296 possible 6-node trees. A tree is a graph where each node 
has at least one link to another node, and the graph contains no cycles. While graphs are an 
important combinatorial structure, the diversity measure can be applied to more restricted 
classes, such as trees as illustrated in Figure 8. This behavior, of a few structures with higher 
25 diversity than typical cases, is analogous to that seen with trees using the previous measure of 
diversity described in B. A. Huberman andT. Hogg, Complexity and adaptation, Physica, 
22D:376-384, 1986. Thus the new measure can be applied to restricted classes of graphs (for 
example, trees) to identify diverse instances within those classes. 

The diversity measure introduced above according to the present invention applies 
30 directly to discrete combinatorial structures. However, in some applications the structure has 
additional properties, such as weights associated with various links (for example, related to 
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their frequency of use). Moreover, it may be useful to allow for errors or incorrectly placed 
links rather than requiring exact identity when matching substructures. 

The diversity measure according to the present invention is also applicable to 
combinatorial structures with continuous-valued properties and errors by a less strict 
requirement for similarity among the substructures. Specifically, instead of counting the 
number of substructures that are identical to each other, the number of substructures that are 
within a certain threshold of similarity as determined by a distance function d(x,y) are counted. 
This use of a distance fiinction can also be viewed as a coarse-graining of the structures 
during their comparison. 

The extended definition proceeds as follows. For a combinatorial structure C of size 
n, we again identify the substructures q, C2, . . c^ of size m, where M is the number of such 
substructures. For each substructure Cj, we determine F^, the fi-action of q, C2 , . . ., c^ whose 
distance fi-om q is less than a specified threshold 6, as described by Equation 2 below. 



In Equation 2 above, Xjj = 1 if d(Ci,Cj) < 6 and is 0 otherwise. With these values, Eq. 1 
generalizes to Equation 3 below. 



Figure 9 illustrates another method 900 for performing the step of computing the 
numbers n^ through nj^ of substructures which are similar to each other in the methods of the 
present invention. Method 900 illustrates an alternative way of performing step 203 
illustrated in Figure 2, in accordance with Equation 2 above. The method begins at step 901. 
At step 902, the substructure counter i is initialized to one. At step 903, the substructure 
counter j is initialized to one, and the similarity count number is initialized to zero. At step 
904, a distance function d(Ci,Cj) is computed to express the distance between q and Cj as a 
single scalar. This distance fiinction is capable of taking into consideration many continuous 
or discrete attributes associated with the combinatorial structure, such as link or node weights. 
Alternatively, the distance function is capable of providing a similarity defimtion for graphs 
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(2) 




(3) 



that is less restrictive than graph isomorphism, without requiring the graphs to have any 
additional link or node weights associated with them. At step 905, the distance computed in 
step 904 is compared to a threshold 6. If the distance is less than the threshold, then the 
similarity count r\ is incremented by one at step 906, as the two substructures q and Cj are 
deemed similar. Step 907 determines if q has been compared to all possible Cj. If not, then 
step 908 increments j and returns the method 900 to step 904. If so, then step 909 stores in 
the table 600 shown in Figure 6. Test 910 determines whether nj has been computed for all 
substructures c^ through c^- If more substructures remain, then step 911 increments the 
substructure counter i by one and returns the method to step 903. If no more substructures 
remain, then step 912 completes the method by returning the similarity counts n^ through n^- 

Figure 10 is a flow chart illustrating another method for performing the step of 
computing the entropy based upon ni through n^ and M in the methods of the present 
invention. The method 1000 is an alternative and general way of implementing the step 204 
shown in Figure 2, and is described by Equation 3. The method 1000 begins at step 1001. At 
step 1002, the sum S is initialized to zero, and the substructure counter i is initialized to one. 
At step 1003, the fraction Fj of all substructures that are similar to substructure q (including Cj 
itself) is computed by dividing the similarity coimt number nj by M. The logarithm of fraction 
Fj is computed at step 1004. At step 1005, the sum S is accumulated by adding the logarithm 
of F; to the current value of the sum S. At step 1006, test 1006 determmes if all M 
substructures have been processed, and if not, step 1007 increments the substructure counter i 
by one and returns the method to step 1003. After all M similarity count numbers % through 
n^ have been processed, step 1008 computes the entropy $(m) by dividing the sum S by M. 
The entropy 0(m) is then returned at step 1009. In contrast to method 400 described above 
and by Equation 1 and method 1 100 described below and by Equation 4, method 1000 sums 
the logarithms of F for all i from 1 to M, regardless of the memberships of the various 
substructures in any distinct similarity group. Thus, all members of every distinct similarity 
group are included in the computation, and therefore step 1005 is always performed M times. 
However, because Equation 3 and method 1000 sum log frequency (rather than frequency 
times log frequency as in Equation 1 and method 400), the inclusion of non-distinct 
substnictures in the computation is exactly compensated by the final di\nsion by M in step 
1008. Thus, method 1000 yields the same result as method 400, but requires more trips 
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through the summation step and does not require any reference to the distinct similarity group 
categorizations of any of the substructures in steps 1006 and 1007. 

Another type of application for diversity measures is to compare two sets of 
structures, or compare some observed structures to a theoretical model. For example, such a 
5 comparison helps determine if the model adequately captures the relationships in the observed 
structures. One way to perform this comparison is use the diversity measures directly. 
Another, more sensitive, method according to the present invention relies on the relative 
entropy of the different cases. That is, Equation 1 is replaced with the following Equation 4. 

OW=Z/ln^ (4) 

/ Pi 

10 In Equation 4 above, p. is the expected frequency of substructure q based on the 

theoretical model (or empirically observed in the second set of structures). This function is 
zero when ^ = pi and is otherwise positive. 

Figure 1 1 is a flow chart illustrating yet another method 1 100 for performing the step 
of computing the entropy using expected frequencies for distinct substructures and based 
15 upon ni through n^ and M in the methods of the present invention. The method 1 100 is yet 
another alternative method for implementing the step 204 shown in Figure 2, and is described 
by Equation 4 above. The method 1100 begins at step 1101. At step 1102, the sum S is 
initialized to zero, and the substructure counter i is initialized to one. At step 1 103, the 
frequency ^ of all substructures that are similar to distinct substructure q (including q itself) 
20 is computed by dividing the similarity count number n^ by M. A quotient is computed at 
step 1 104 by dividing the frequency ^ by an expected frequency pi. The logarithm of the 
quotient is computed at step 1 105. At step 1 106, the frequency ^ is multiplied by the 
logarithm of quotient qi. At step 1 107, the sum S is accumulated by adding the result of step 
1106 to the current value of the sum E. Test 1108 determines if more distinct substructures 
25 exist which have not yet been included in the computation. If so, step 1 109 increases the 

substructure counter i to the next distinct substructure, and returns the method to step 1103. 
If not, step 1110 returns the sum S as the entropy $(m). 

The easily-computed measure of diversity for combinatorial structures according to 
the present invention is applicable to a variety of Web-based applications. Some possibilities 
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include: 

1 . Automated search methods that require an initial "random" set of web pages from 
which to start searching. These search methods benefit from checking the diversity of this set 
using the measure according to the present invention. In particular, it is helpful to avoid low 

5 diversity starting sets which might tend to focus the search on a limited range of topics. 

2. A search specifically looks for different points of view on a topic, or communities of 
different individuals involved in different aspects. Using the diversity measure according to 
the present invention helps to identify the broadest possible sampling of distinct views from a 
relatively small set of returned items. 

10 3 . Agent-based transactions on the Web, including electronic commerce, benefit from 
comparing a diverse set of options as measured by the diversity measure according to the 
present invention before selecting the best transaction to proceed with or recommend to a 
user. The rate at which diversity changes over time as additional Web pages are sampled 
suggests when it is reasonable to stop searching. This is an example of the more general 

1 5 ecological trade-off between exploring for new possibilities and exploiting those aheady 
found. 

4. The diversity measure according to the present invention, especially using relative 
entropy as described in Equation 4 helps to discriminate among different sets of Web pages. 
For example, this determines whether additional search terms or refinements make significant 
20 differences in the variety of pages returned from the search. Reducing the diversity 

corresponds to a more tightly focused result while larger diversity gives a broader variety of 
results. 

Beyond these applications, the relation of diversity to complexity and adaptability in 
ecological contexts identifies parts of the Web that are most adaptable or unstable with 

25 respect to changes in the environment (for example, new software techniques or user tasks 
that spread around the Web). Because models of ecosystems and conununication patterns in 
organizations can exhibit sudden instabilities as their size, connectivity or variation in use 
mcreases, identifying particularly diverse areas of the web provides early observations of any 
new instabilities or large-scale changes. 

3 0 More speculatively, designed diversity of aggregate properties of the links helps to 

modify the rate of such changes or their ultimate result. Thus examining whether and how 
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diversity is related to adaptability suggests how often to re-index Web pages based on their 
likely rates of changes. 

Beyond Web based applications, this new diversity measure according to the present 
mvention also applies to more general search problems. For instance, maintaining diversity is 
an important problem in genetic algorithms that could benefit fi-om a measure of diversity not 
just in the number of different individuals in a population but also the structure inherent in 
their overlapping behaviors or capabilities. Another example is using diversity to determine the 
extent to which a new search problem is likely to be a typical instance of various previously 
studied classes of combinatorial searches. Because of the observed regularities in many classes 
of search problems (4), such a measure could help determining whether the new problem is 
likely to be particularly hard before attempting the search. 

While the general diversity measure according to the present invention is helpful, there 
are also a number of limitations. First, for particular applications, one still needs to identify 
the relevant substructures and an appropriate notions of identity for them (or, alternatively, a 
metric and associated threshold). These substructures, in turn, must be feasible to compute. 
Second, for empirically observed classes of combinatorial objects there is no analytical 
expression for the maximum possible diversity for structures within that class (for example, 
due to various, possibly unknovm, constraints on the types of structures that appear). In such 
cases it is difficult to know whether a particular value of diversity is near the maximum 
possible value of diversity. Third, some important aspects of diversity are in the dynamical 
evolution of the structures (for example, the history of changes to a Web page) rather than its 
static form at any particular time. These aspects are not included in the measure based on the 
structure of combinatorial objects. Finally, much of the interesting "higher level" content of 
the Web does not appear ui purely "syntactic" structures such as the graph of relations among 
pages. This issue applies generally to automated information extraction, but with large 
collections, statistical techniques that do not explicitly evaluate content are nevertheless quite 
successful. This observation suggests that simple measures of structure, including the 
diversity measure introduced here, are useflilly applied to collections of Web pages. 

Although the present invention has been described with respect to its preferred and 
alternative embodiment, those embodiments are offered by way of example, not by way of 
limitation. It is to be understood that various additions and modifications can be made 
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without departing from the spirit and scope of the present invention. Accordingly, all such 
additions and modifications are deemed to lie with the spirit and scope of the present 
invention as set out in the appended claims. 
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WHAT IS CLAIMED IS . 



1 1 . A method for computing a diversity measure for a predetermined combinatorial 

2 structure C having n elements, the method comprising steps of: 

3 (a) identifying M substructures c^ through c^ each having m elements from among the n 

4 elements of the predetermined combinatorial structure C, where M equals n! / [(n-m)! m!]; 

5 (b) for each substructure for i from 1 to M, determining a number nj of the M 

6 substructures c^ through c^ that are similar to the substructure q; and 

7 (c) computing a first entropy 0(m) based upon all the numbers ni computed during step 

8 (b) and based upon M in computed step (a); 

12. A method as in claim 1, further comprising the steps of: 

2 (d) repeating steps (a) and (b) with m+1 substituted for m; 

3 (e) computing a second entropy $(m+l) based upon all the numbers r\ and M computed 

4 during step (d); and 

5 (f) subtracting the second entropy 3>(m+l) from the first entropy 0(m) to produce the 

6 diversity measure. 

13. A method as in claim 2, wherein steps (c) and (e) comprise the steps of: 

2 for each i from 1 to M: 

3 computing a fraction Fj by dividing ni by M; and 

4 computing a logarithm of fraction F^; 

5 computing a sum by adding all logarithms of fractions Fj for i from I to M; and 

6 dividing the sum by M. 

1 4. A method as in claim 2, wherein step (b) comprises the steps of, for each substructure 

2 Ci for i from 1 to M: 

3 for each substructure Cj for j from 1 to M: 

4 computing a distance fiinction d(Ci,Cj) representing a measure of a difference 

5 between substructure Cj and substructure Cj; 

6 comparing the distance Sanction d(Ci,Cj) to a threshold; and 
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7 determining the substructures q and Cj to be similar if and only if the distance 

8 function d(cj,Cj) is less than the threshold. 

15. A method as in claim 2, wherein steps (c) and (e) comprise the steps of: 

2 for each distinct substructure c^: 

3 computing a frequency ^ by dividing by M; 

4 computing a logarithm of frequency ^; and 

5 computing a product by multiplying the frequency ^ and the logarithm of 

6 frequency ^; and 

7 computing a sum by adding all products of the frequencies ^ and the logarithms of 

8 frequencies ^. 

16. A method as in claim 2, wherein step (b) comprises the steps of 

2 for each substructure q for i from 1 to M: 

3 monotonically renumbering m elements of Cj from 1 to m; and 

4 for each substructure Cj for j from 1 to M: 

5 monotonically renumbering m elements of Cj from 1 to m; and 

6 determining the substructures Cj and to be similar if and only if they are 

7 identical. 

17. A method as in claim 2, wherein step (b) comprises the steps of 

2 for each substructure c^ for i from 1 to M: 

3 monotonically renumbering m elements of c^ from 1 to m; and 

4 for each substructure c, for j from 1 to M: 

5 monotonically renumbering m elements of Cj from 1 to m; and 

6 determining the substructures q and Cj to be similar if and only if they are 

7 identical or isomorphic. 

18. A method as in claim 2, wherein steps (c) and (e) comprise the steps of 

2 for each distinct substructure q: 

3 computing a frequency ^ by dividing nj by M; 
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4 computing a quotient by dividing the frequency ^ by an expected frequency pj; 

5 computing a logarithm of quotient q{, and 

6 computing a product by multiplying the frequency ^ and the logarithm of 

7 quotient q^; and 

8 computing a sum by adding all products of the frequencies and the logarithms of 

9 quotients q^. 

1 9. A method as in claim 2, wherein the predetermined combinational structure C 

2 comprises a linked graph, wherein the n elements comprise n nodes. 



1 10. A computer readable storage medium, comprising: 

2 computer readable program code embodied on said computer readable storage 

3 medium, said computer readable program code for programming a computer to perform a 

4 method for computing a diversity measure for a predetermined combinatorial structure C 

5 having n elements, the method comprising steps of: 

6 (a) identifying M substructures Ci through c^ each havmg m elements from among the n 

7 elements of the predetermined combinatorial structure C, where M equals n! / [(n-m)! m!]; 

8 (b) for each substructure % for i from 1 to M, determining a number rii of the M 

9 substructures c^ through c^ that are similar to the substructure q; and 

1 0 (c) computing a first entropy *(m) based upon all the numbers computed during step 

1 1 (b) and based upon M in computed step (a); 
12 

13 11. A computer readable storage medium as in claim 10, the method fiirther comprising 

14 the steps of 

15 (d) repeating steps (a) and (b) with m+1 substituted for m; 

16 (e) computing a second entropy 0(m+l) based upon all the numbers % and M computed 

17 during step (d); and 

1 8 (f) subtracting the second entropy <I»(m+l) from the first entropy 0(m) to produce the 

19 diversity measure. 
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1 12. A computer readable storage medium as in claim 1 1, wherein steps (c) and (e) 

2 comprise the steps of: 

3 for each i from 1 to M: 

4 computing a fraction by dividing rij by M; and 

5 computing a logarithm of fraction Fj; 

6 computing a sum by adding all logarithms of fractions Fj for i from 1 to M; and 

7 dividing the sum by M. 

1 13. A computer readable storage medium as in claim 11, wherein step (b) comprises the 

2 steps of, for each substructure q for i from 1 to M: 

3 for each substructure Cj for j from 1 to M: 

4 computing a distance frinction d(Ci,Cj) representing a measure of a difference 

5 between substructure Cj and substructure c,-; 

6 comparing the distance fiinction d(Ci,Cj) to a threshold; and 

7 determining the substructures q and Cj to be similar if and only if the distance 

8 fiinction d(Ci,Cj) is less than the threshold. 

1 14. A computer readable storage medium as in claim 1 1, wherein steps (c) and (e) 

2 comprise the steps of: 

3 for each distinct substructure q: 

4 computing a frequency ^ by dividing i\ by M; 

5 computing a logarithm of frequency ^; and 

6 computing a product by multiplying the frequency ^ and the logarithm of 

7 frequency ^; and 

8 computing a sum by adding all products of the frequencies ^ and the logarithms of 

9 frequencies ^. 

1 15. A computer readable storage medium as in claim 11, wherein step (b) comprises the 

2 steps of: 

3 for each substructure q for i from 1 to M: 

4 monotonically renumbering m elements of Cj from 1 to m; and 
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5 for each substructure Cj for j from 1 to M: 

6 monotonically renumbering m elements of Cj from 1 to m; and 

7 determining the substructures Cj and Cj to be similar if and only if they are 

8 identical. 

1 16. A computer readable storage medium as in claim 11, wherein step (b) comprises the 

2 steps of 

3 for each substructure Cj for i from 1 to M: 

4 monotonically renumbering m elements of Cj from 1 to m; and 

5 for each substructure Cj for j from 1 to M: 

6 monotonically renumbering m elements of Cj from 1 to m; and 

7 determining the substructures q and Cj to be similar if and only if they are 

8 identical or isomorphic. 

1 17. A computer readable storage medium as in claim 1 1, wherein steps (c) and (e) 

2 comprise the steps of: 

3 for each distinct substructure q: 

4 computing a frequency ^ by dividing nj by M; 

5 computing a quotient by dividing the frequency ^ by an expected frequency pj; 

6 computing a logarithm of quotient q^;, and 

7 computing a product by multiplying the frequency ^ and the logarithm of 

8 quotient qj; and 

9 computing a sum by adding all products of the frequencies :5 and the logarithms of 
10 quotients qj. 

1 18. A computer readable storage medium as in claim 1 1 , wherein the predetermined 

2 combinational structure C comprises a linked graph, wherein the n elements comprise n nodes. 

1 19. A computer system, comprising: 

2 a processor; and 
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3 a processor readable storage medium coupled to the processor having processor 

4 readable program code embodied on said processor readable storage medium, said processor 

5 readable program code for programming the computer system to perform a method for 

6 computing a diversity measure for a predetermined combinatorial structure C having n 

7 elements, the method comprising steps of: 

8 (a) identifying M substructures c^ through c^ each having m elements from among the n 

9 elements of the predetermined combinatorial structure C, v^here M equals n! / [(n-m)! m!]; 

10 (b) for each substructure Cj, for i from 1 to M, determining a number n^ of the M 

1 1 substructures Cj through c^ that are similar to the substructure q; and 

12 (c) computing a first entropy 0(m) based upon all the numbers % computed during step 

13 (b) and based upon M in computed step (a); 

1 20. A computer system as in claim 19, the method fiirther comprising the steps of: 

2 (d) repeating steps (a) and (b) with m+1 substituted for m; 

3 (e) computing a second entropy <I>(m+l) based upon all the numbers nj and M computed 

4 during step (d); and 

5 (f) subtracting the second entropy 0(m+l) from the first entropy 0(m) to produce the 

6 diversity measure. 

1 21 . A computer system as in claim 20, wherein steps (c) and (e) comprise the steps of: 

2 for each i from 1 to M: 

3 computing a fraction Fj by dividing nj by M; and 

4 computing a logarithm of fraction F^; 

5 computing a sum by adding all logarithms effractions Fj for i from 1 to M; and 

6 dividing the sum by M. 

1 22. A computer system as in claim 20, wherein step (b) comprises the steps of, for each 

2 substructure C; for i from 1 to M: 

3 for each substructure Cj for j from 1 to M: 

4 computing a distance fimction d(%c^ representing a measure of a difference 

5 between substructure q and substructure q; 
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6 comparing the distance fiinction d(Ci,Cj) to a threshold; and 

7 determining the substructures q and Cj to be similar if and only if the distance 

8 function d(Ci,Cj) is less than the threshold. 

1 23. A computer system as in claim 20, wherein steps (c) and (e) comprise the steps of: 

2 for each distinct substructure c^: 

3 computing a frequency ^ by dividing by M; 

4 computing a logarithm of frequency ^; and 

5 computing a product by multiplying the frequency ^ and the logarithm of 

6 frequency ^; and 

7 computing a sum by adding all products of the frequencies ^ and the logarithms of 

8 frequencies ^. 

1 24. A computer system as in claim 20, wherein step (b) comprises the steps of 

2 for each substructure q for i from 1 to M: 

3 monotonically renumbering m elements of c^ from 1 to m; and 

4 for each substructure Cj for j from 1 to M: 

5 monotonically renumbering m elements of Cj from 1 to m; and 

6 ' determining the substructures Ci and Cj to be similar ifand only if they are 

7 identical. 

1 25. A computer system as in claim 20, wherein step (b) comprises the steps of: 

2 for each substructure c^ for i from 1 to M: 

3 monotonically renumbering m elements of Cj from 1 to m; and 

4 for each substructure Cj for j from 1 to M: 

5 monotonically renumbering m elements of Cj from 1 to m; and 

6 determining the substructures q and Cj to be similar if and only if they are 

7 identical or isomorphic. 

1 26. A computer system as in claim 20, wherein steps (c) and (e) comprise the steps of 

2 for each distinct substructure q: 
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3 computing a frequency ^ by dividing by M; 

4 computing a quotient by dividing the frequency ^ by an expected firequency p^; 

5 computing a logarithm of quotient q;; and 

6 computing a product by multiplying the frequency ^ and the logarithm of 

7 quotient q^; and 

8 computing a sum by adding ail products of the frequencies ^ and the logarithms of 

9 quotients q^. 

1 27. A computer system as in claim 20, wherein the predetermined combinational structure 

2 C comprises a linked graph, wherein the n elements comprise n nodes. 
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ABSTRACT OF THE DISCLOSURE 
A method for computing a diversity measure H(m) for combinatorial structures 
involves identifying all M possible substructures having m elements from among the n 
elements of the combinatorial structure. The number of the substructures that are similar to 
each such substructure is determined, and the frequency of each distinct substructure is 
5 calculated using the number of similar substructures and the total number of substructures M. 
The method uses the frequency of each distinct substructure to compute an entropy 
corresponding to m. By the same process described above, and entropy corresponding to 
m+1 is computed. The entropy corresponding to m+1 is subtracted from the entropy 
corresponding to m to produce the diversity measure H(m). In the preferred embodiment, 
10 similar substructures are determined by being identical or isomorphic. In an alternative 
embodiment, a distance ftinction is used to compute a distance between two substructures, 
and only if the distance is less than a predetermined threshold are the two substructures 
determined to be similar. In the preferred embodiment, the entropy is computed by summing 
the frequency of each distinct substructure multiplied by the logarithm of the frequency of 
15 each distinct substructure. In an alternative embodiment, the entropy is computed by 

summing the frequency of each distinct substructure by the logarithm of the quotient of the 
frequency divided by an expected frequency of the distinct substructure. Greneralized graphs 
such as can be used to model the Web are combinatorial structures suitable for use with the 
methods according to the present invention. 
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